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Abstract 

Two new proofs of the Fisher information inequality (FII) using data processing 
inequahties for mutual information and conditional variance are presented. 
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^ ■ 1 Introduction 

o 

^^ I In parameter estimation problems, the Fisher information matrix of a measurement X 

^O ■ relative to a vector parameter 6 is defined as 

U- J(X;0)1^'cOv|^ln/«(X)| (1) 

►v> I where {/^(x)} is a family of probability density functions of X parameterized by 6, 

^H ' and COV{-} denotes the covariance matrix calculated with respect to /©(x). A special 

form of the Fisher information matrix that shows up regularly in information theory 
^ and physics |2] is the Fisher information matrix of a random vector with respect to 
a translation parameter: 

J(X)''=2^fj(0 + X;0) = COV{px(X)} (2) 

where the score function px is defined as 

Px(x)1^'^ln/(x), (3) 

ax 

and /(x) is the probability density function of the random vector X. Unlike in the 
general definition ^, this special form of the Fisher information matrix is a function 
of the probability density of the random vector alone, and not of its parametrization. 



Let A^i and A'^2 be two independent random variables with probability density func- 
tions in the real line TZ. The classical Fisher information inequality (FIX) states that 

{a + bfj{Ni+N2)<a^J{Ni) + b^J{N2), V a, 6 > 0. (4) 

Choosing a = 1/ J{Ni) and h = l/J{N2), we obtain from Q) that 

111 

>T7^ + T71^' (5) 



JiNi + N2) - J{Ni) J(iV2)' 

where the equality holds if and only if both A'^i and A'^2 are Gaussian. Compared with 
(0J), © is usually thought of as the canonical form of the classical FII. 

The classical Stam-Blachman proof IHll^ of the FII relies on the following conditional- 
mean representation of the score function for the sum of two independent random 
variables: 

PN^+N^H = E[pNANi)\Ni + N2 = n], i = l,2, (6) 

and then applies the Cauchy-Schwartz inequality. Although the proof is direct and 
concise, it does not bring any operational meaning to the FII. 

In the excellent contribution [Sj, Zamir showed that the FII can be proved using 
the following data processing inequality for Fisher information: 

J(Y;0)^J(X;0) (7) 

if ^ X — > Y satisfies the chain relation 

/(x,y|0) = /0(x)/(y|x), (8) 

i.e., the conditional distribution of Y given X is no longer a function of the parameter 
0. In [Sj, Zamir considered the parameter estimation model: 

Xi = ae + Ni 

X2 = he + N2 ^ ' 

where a and b are two arbitrary nonnegative real numbers. Note that 9 — > (Xi, X2) — > 
X1+X2 satisfies the chain relation Q in a trivial way. By the data processing inequality 
(|7|) for Fisher information, we have 

J(X^+X2;e)<J{Xi,X2;9). (10) 

Thus, the desired inequality (|1]) can be obtained by substituting the parameter estima- 
tion model Q into (|lflj) . Moreover, it can be seen that the difference between the two 
sides of (inj corresponds to the loss in the Cramer-Rao bound due to the "processing" 
in a certain linear additive noise model for parameter estimation. 

It is worthy of mentioning that an identical argument without assuming the in- 
dependence between A^i and A'^2 proves a generalization of the classical FII to the 
dependent- variable case: 

(a + 6)2 J(iVi + iVs) < [a b]J{Ni,N2)[a 6]*. (11) 

This result was initially proved in ^ Th. 2] using the conditional-mean representation 
of the score function for the sum of two dependent random variables. 



2 New Proofs of the FII 

Data processing is a general principle in information theory, in that any quantity under 
the name "information" should obey some sort of data processing inequality. In this 
sense, Zamir's data processing inequality for Fisher information merely pointed out 
the fact that Fisher information bears the real meaning as an information quantity. 
Interestingly enough, at the very beginning of jH], Zamir also pointed out that the 
data processing principle applies to mutual information and conditional variance as 
well. Specifically, if random variables W ^ X ^ Y form a Markov chain, the mutual 
information among them satisfies 

I{W;Y)<I{W;X), (12) 

and the conditional variances satisfy 

VAR[I^|y] > YAR[W\X] (13) 

where VAR[VF|X] = E[{W - E[W^|X])^]. The main purpose of this note is to provide 
two new proofs of the FII using the more familiar data processing inequalities H12|) and 
(fT3|l . respectively. 



2.1 A Communications Proof 

Consider the communication model: 

Xi = a./iW + Ni 

X2 = b^W + N2 



t > (14) 



where W is standard Gaussian and W, Ni and N2 are pairwise independent. By the 
scalar De-Bruijn identity jl], we have 

IiW;Xi) = ^J{Ni)+o{t), (15) 

I{W;X2) = ^-^J{N2) + o{t), (16) 

and I{W;Xi+X2) = -^^^^ J(iVi + iVs) + o(t) (17) 

where ^ ^ in the limit as t j 0. Note that W ^ (^1,^2) ^ Xi + X2 forms a 
trivial Markov chain. By the data processing inequality (|12j) for mutual information, 
we have 

I{W; X1 + X2) < IiW;Xi,X2) (18) 

= IiW;Xi) + I{W;X2\Xi) (19) 

< I{W;Xi)+I{W;X2) (20) 

where the last inequality follows from I{W; X2IX1) < I{W; X2) because of the Markov 
chain X2 ^ W ^ Xi [3 p. 33]. Substituting ^, ^ and ^ into ^, we obtain 

(a + hftJ{Ni + N2) < a^tJ{Ni) + bHj{N2) + o{t). (21) 



Dividing both sides by t and letting t | 0, we obtain the desired inequahty (jlj). This 
completes the proof of the classical FII using the data processing inequality for mutual 
information. 

Remark. The above proof can still go through even without assuming W is Gaussian. 
This is because, as mentioned in the last paragraph of |H1 Sec. Ill], the scalar De-Bruijn 
identity holds for any W whose first four moments coincide with those of Gaussian one. 

2.2 A Bayesian Estimation Proof 

Consider the Bayesian estimation model: 

Xi = Ni + ^Wi 



X2 = N^ + VbtW, ' *>° ^^^^ 

where Wi and W2 are stand Gaussian random variables, and A'^i, A'^2) Wi and W2 are 
pairwise independent. The following lemma provides the needed connection between 
conditional variance and Fisher information. 

Lemma 1 Let N and W be two independent random variables. Assuming W is Gaus- 
sian with zero mean and variance a^ , we have 

J{N + W) = \{a^ - VAR[iV| Af + W]}. (23) 

Proof: See Appendix \M ■ 

Remark. Our proof uses the conditional-mean representation of the score function 
for the sum of two independent random variables, which suggests a natural connection 
between conditional-mean estimators and Fisher information. We mention here that 
Lemma ^ may also be deduced from a recent result of Guo, Shamai and Verdii |HJ 
Th. 1], in conjunction with the scalar De-Bruijn identity. Therefore, our proof can be 
viewed as an alternative proof of the result of Guo, Shamai and Verdu. 

Applying Lemma ^ to model H22|l . we obtain 

VAR[Afi|A:i] = at-a'^t'^J{Xi), (24) 

VAR[A^2|A:2] = ht-b'^t^J{X2), (25) 

and VAR[A^i+A'2|A:i+A:2] = {a + b)t - {a + bft^ J {X1+X2). (26) 

Note that A'^i -|- N2 — > (Xi, X2) — > Xi + X2 forms a trivial Markov chain. By the data 
processing inequality H13() for conditional variance, we have 

YAR[Ni + N2\Xi + X2] > YAR[Ni+N2\Xi,X2] (27) 

= VAR[A^i|A:i]+VAR[Ar2|A:2] (28) 

where the last equality follows from the fact that A^i, N2, Wi and W2 are pairwise 
independent. Substituting (^H), (j25() and (|^^ into (|^H|) . we obtain 

(a + b)\j{Xi + A:2) < a^ J(Xi) + 6V(X2). (29) 



Note that J{Xi), J{X2) and J{Xi + X2) approach J(iVi), J{N2) and J{Ni + N2), 
respectively, in the hmit as i | 0. The desired inequahty (@J) thus follows from (|29|) by 
letting t I 0. This completes the proof of the classical FII using the data processing 
inequality for conditional variance. 

Remark. As in Zamir's proof, the necessity of the equality condition in does not 
follow easily in either of the new proof. As a matter of fact, it becomes less apparent 
due to the limiting argument used in both proofs. 

A Proof of Lemma [T] 

Let X = N + W . Since score functions always have zero mean, the Fisher information 
of X can be written as 

J{X)=¥.[pl{X)]. (30) 

By the conditional-mean representation of the score function for the sum of two inde- 
pendent random variables, we have 

px{x) = nPw{W)\X = x\ (31) 

= \^\-W\X = x\ (32) 

= \{mN\X = x\-x} (33) 

where (|32j) holds because W is Gaussian with zero mean and variance o"^ so we have 
pw{w) = —w/a'^. It then follows from (|Hn|l and H33() that 
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J{X) = ^E (E[iV|X] - Xy (34) 



E 



\2 



{W+{N -¥.[N\X\)Y (35) 



\ {cj2 + VAR[Af |A:] + 2¥.[W[N - ^[N\X])]] . (36) 



Further note that 



W.[W{N -¥.[N\X])] = -¥.[{N - X){N -¥.[N\X])] (37) 

= -E[((Af-E[iV|X]) + (E[iV|X] -X))(iV-E[iV|X])] (38) 

= -VAR[iV|A:] (39) 

because, by the orthogonality principle [10], we have W.[(E[N\X]-X){N -W.[N\X])] = 0. 
Substituting ^^ into (jHEl), we obtain the desired representation 

J{X) = ^^{a^-\kR[N\X]}. (40) 
This completes the proof of Lemma ^ 
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